In this workshop, we will explore the advanced features of the Nextflow language and runtime, and learn how to use them to write efficient and scalable data-intensive workflows.
We will cover topics such as parallel Channels, Processes, and Operators.
We now know how to create and use Channels to send data around a workflow. We will now see how to run tasks within a workflow using processes.
A process is the way Nextflow executes commands you would run on the command line or custom scripts.
The syntax is defined as follows:
process < NAME > {
[ directives ]
input:
< process inputs >
output:
< process outputs >
when:
< condition >
[script|shell|exec]:
< user script to be executed >
}
For example:
Example:cat bin/example_process.nf
#!/usr/bin/env nextflow
params.input = "/home/sinrasu/sinrasu-test/*_{1,2}.fastq.gz"
process FASTQC {
tag "$meta.id"
cpus 2
memory 1.GB
conda "bioconda::fastqc=0.11.9"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0' :
'biocontainers/fastqc:0.12.1--hdfd78af_0' }"
input:
tuple val(meta), path(reads)
output:
tuple val(meta), path("*.html"), emit: html
tuple val(meta), path("*.zip") , emit: zip
script:
"""
fastqc $reads
"""
}
workflow {
sample_ch = Channel.fromFilePairs(params.input, checkIfExists: true)
sample_ch.view()
FASTQC(sample_ch)
}
nextflow run bin/example_collect.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_collect.nf` [silly_ride] DSL2 - revision: 0eef816596
## ['SRR6357076', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357076_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357076_2.fastq.gz], 'SRR6357071', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357071_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357071_2.fastq.gz], 'SRR6357070', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357070_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357070_2.fastq.gz], 'SRR6357072', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357072_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357072_2.fastq.gz]]
Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.
Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra configuration of logging information.
List of directives
| Name | Description |
|---|---|
cpus |
Allows you to define the number of (logical) CPUs required by the process’ task. |
time |
Allows you to define how long the task is allowed to run (e.g., time 1h: 1 hour, 1s 1 second, 1m 1 minute, 1d 1 day). |
memory |
Allows you to define how much memory the task is allowed to use (e.g., 2 GB is 2 GB). Can also use B, KB,MB,GB and TB. |
disk |
Allows you to define how much local disk storage the task is allowed to use. |
tag |
Allows you to associate each process execution with a custom label to make it easier to identify them in the log file or the trace execution report. |
publishDir |
Allows you to save important, non-intermediary, and/or final files in a results folder. |
In this chapter, we take a curated tour of the Nextflow operators. Commonly used and well understood operators are not covered here - only those that we’ve seen could use more attention or those where the usage could be more elaborate. These set of operators have been chosen to illustrate tangential concepts and Nextflow features.
The flatten operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.
Example:cat bin/example_flatten.nf
input_ch = Channel.from(["path/file1.fastq", "path/file2.fastq"], ["path/file3.fastq", "path/file4.fastq"], ["path/file5.fastq", "path/file6.fastq"])
input_ch
.flatten()
.view()
nextflow run bin/example_flatten.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_flatten.nf` [high_bardeen] DSL2 - revision: d714f8acfd
## path/file1.fastq
## path/file2.fastq
## path/file3.fastq
## path/file4.fastq
## path/file5.fastq
## path/file6.fastq
The collect operator collects all of the items emitted by a channel in a list and returns the object as a sole emission.
Example:cat bin/example_collect.nf
workflow {
sample_ch = Channel.fromFilePairs("./data/reads/*_{1,2}.fastq.gz", checkIfExists:true)
.collect()
sample_ch.view()
}
nextflow run bin/example_collect.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_collect.nf` [shrivelled_bohr] DSL2 - revision: 0eef816596
## ['SRR6357076', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357076_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357076_2.fastq.gz], 'SRR6357071', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357071_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357071_2.fastq.gz], 'SRR6357070', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357070_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357070_2.fastq.gz], 'SRR6357072', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357072_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357072_2.fastq.gz]]
The groupTuple operator collects tuples (or lists) of values emitted by the source channel, grouping the elements that share the same key. Finally, it emits a new tuple object for each distinct key collected.
Example:cat bin/example_groupTuple.nf
ch = channel
.of( ['wt','wt_1.fq'], ['wt','wt_2.fq'], ["mut",'mut_1.fq'], ['mut', 'mut_2.fq'] )
.groupTuple()
.view()
nextflow run bin/example_groupTuple.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_groupTuple.nf` [stoic_tuckerman] DSL2 - revision: 14dfed2105
## [wt, [wt_1.fq, wt_2.fq]]
## [mut, [mut_1.fq, mut_2.fq]]
The branch operator allows you to forward the items emitted by a source channel to one or more output channels.
The selection criterion is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. For the first expression that evaluates to a true value, the item is bound to a named channel as the label identifier. For example:
cat bin/example_branch.nf
workflow {
params.input = "data/samplesheet.csv"
Channel.fromPath(params.input)
.splitCsv(header: true)
.map{ row -> [[sample:row.sample,strand:row.type],[row.fastq_1,row.fastq_2]]}
.branch{ meta, reads ->
single: meta.strand == "single"
paired: meta.strand == "paired"
}
.set { samples }
samples.single.view()
}
nextflow run bin/example_branch.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_branch.nf` [jovial_heyrovsky] DSL2 - revision: 8b1f242980
## [[sample:WT_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz]]
## [[sample:WT_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz]]
## [[sample:WT_REP2, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz]]
## [[sample:RAP1_IAA_30M_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz]]
A common Nextflow pattern is for a simple samplesheet to be passed as primary input into a workflow. We’ll see some more complicated ways to manage these inputs later on in the workshop, but the splitCsv (docs) is an excellent tool to have in a pinch. This operator will parse a csv/tsv and return a channel where each item is a row in the csv/tsv:
cat bin/example_splitCsv.nf
workflow {
params.input = "https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/v3.10/samplesheet_test.csv"
Channel.fromPath(params.input)
.splitCsv(header: true)
.view()
}
Output:
nextflow run bin/example_splitCsv.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_splitCsv.nf` [cheesy_leibniz] DSL2 - revision: 5650611e54
## [sample:WT_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz, strandedness:auto]
## [sample:WT_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz, strandedness:auto]
## [sample:WT_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_IAA_30M_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz, strandedness:reverse]
The multiMap operator is a way of taking a single input channel and emitting into multiple channels for each input element.
Example:
cat bin/example_multiMap.nf
Channel.of( 1, 2, 3, 4, 5 )
.multiMap {
small: it
large: it * 10
}
.set { numbers }
numbers.small | view { num -> "Small: $num"}
numbers.large | view { num -> "Large: $num"}
nextflow run bin/example_multiMap.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_multiMap.nf` [confident_cray] DSL2 - revision: 3d406c0ad2
## Large: 10
## Large: 20
## Large: 30
## Large: 40
## Large: 50
## Small: 1
## Small: 2
## Small: 3
## Small: 4
## Small: 5
The exercises below are designed to strengthen your knowledge in Nextflow more. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.
Your Nextflow module should include the following:
Input parameters for specifying the input data (e.g., aligned BAM files).
script should contain any two functions from samtools.
setup the process to run on minimal no.of cpus and tag the process with sample ids
sample
1 WT_REP1
2 WT_REP1
3 WT_REP2
4 RAP1_UNINDUCED_REP1
5 RAP1_UNINDUCED_REP2
6 RAP1_UNINDUCED_REP2
7 RAP1_IAA_30M_REP1
fastq_1
1 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
2 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
3 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
4 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
5 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
6 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
7 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
fastq_2
1 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
2 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
3 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
4
5
6
7 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
strandedness
1 auto
2 auto
3 reverse
4 reverse
5 reverse
6 reverse
7 reverse